Systems and methods that utilize machine learning algorithms to facilitate assembly of aids vaccine cocktails

ABSTRACT

The subject invention provides systems and methods that facilitate AIDS vaccine cocktail assembly via machine learning algorithms such as a cost function, a greedy algorithm, an expectation-maximization (EM) algorithm, etc. Such assembly can be utilized to generate vaccine cocktails for species of pathogens that evolve quickly under immune pressure of the host. For example, the systems and methods of the subject invention can be utilized to facilitate design of T cell vaccines for pathogens such HIV. In addition, the systems and methods of the subject invention can be utilized in connection with other applications, such as, for example, sequence alignment, motif discovery, classification, and recombination hot spot detection. The novel techniques described herein can provide for improvements over traditional approaches to designing vaccines by constructing vaccine cocktails with higher epitope coverage, for example, in comparison with cocktails of consensi, tree nodes and random strains from data.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a Continuation-in Part of U.S. patent applicationSer. No. 10/977,415, entitled “SYSTEMS AND METHODS THAT UTILIZE MACHINELEARNING ALGORITHMS TO FACILITATE ASSEMBLY OF AIDS VACCINE COCKTAILS,”filed Oct. 29, 2004 now abandoned. This application is also related toU.S. patent application Ser. No. 11/324,506, entitled, “SYSTEMS ANDMETHODS THAT UTILIZE MACHINE LEARNING ALGORITHMS TO FACILITATE ASSEMBLYOF AIDS VACCINE COCKTAILS”, filed Dec. 30, 2005. The entireties of theaforementioned applications are incorporated herein by reference.

BACKGROUND

The human body has the ability to develop extremely powerful specificimmunity against individual invading agents such as lethal bacteria,viruses, toxins, etc. This ability is typically referred to as acquiredimmunity. In general, two basic but closely allied types of acquiredimmunity occur in the body. In one type, the body develops circulatingantibodies (referred to as bursal, or B lymphocytes), which are globulinmolecules that are capable of attacking an invading agent. This type ofacquired immunity is referred to as humoral immunity. The other type ofacquired immunity is achieved through the formation of large numbers ofactivated lymphocytes (referred to as thymic, or T lymphocytes or Tcells) that are specifically designed to destroy a foreign agent. Thistype of immunity is called cell-mediated immunity.

Upon exposure to particular antigens, T lymphocytes of the lymphoidtissue proliferate and release large numbers of activated T cells. TheseT cells pass into the circulation and are distributed throughout thebody, passing through the capillary walls into the tissue spaces, backinto the lymph and blood once again, and circulating again and againthroughout the body, sometimes lasting for month or even years. Inaddition, T lymphocyte memory cells are formed and preserved in thelymphoid tissue and become additional T lymphocytes of that specificclone. These additional T lymphocytes can spread throughout the lymphoidtissue of the body, and, on subsequent exposure to the same antigen, therelease of activated T cells can occur far more rapidly and much morepowerfully than in a first response.

Cytotoxic T cells are direct attack cells that are capable of killingmicroorganisms and the body's own cells and, thus, are often referred toas “killer” cells. In general, the receptor proteins on the surfaces ofthe cytotoxic cells cause them to bind tightly to those organisms orcells that contain their binding-specific antigen. In the instance ofthe Human Immunodeficiency Virus (HIV), the immune system of theinfected human produces killer T-cells that recognize epitopes (patternsof 8-11 amino acids) on the surface of T cells infected by HIV and bindthereto. The immediate affect of the binding is swelling of the T celland release of cytotoxic substances into the attacked cell with eventualdestruction of the cell. Cytotoxic T cells are especially lethal totissue cells that have been invaded by viruses since many virusparticles become entrapped in the membranes of these cells and attractthe T cells due to viral antigenicity.

Through exposure to pathogen or pathogen-like proteins, the adaptiveimmune system can be primed to react to as many foreign amino acidpatterns as possible, given resource and specificity constraints. Suchexposure can be achieved through vaccines, which have been used for manyyears to cause acquired immunity against specific diseases.

Pathogen evolution typically converges to a balance between avoidingdetection and preserving functionality. As the immune system has alocalized effect on the pathogen's genome, the evolution will bedifferent in different hosts and different in different parts of thepathogen's proteins. With traditional approaches to designing vaccinesfor rapidly evolving pathogens, evolution typically is modeled as aprocess of random site-independent mutations, wherein total mutation ina genome or an entire protein is assumed to capture evolutionarydistance between a pair of sequences. However, the environment canaffect disparate pieces of the genome and/or peptides in a proteindifferently. On the population level, this can lead to creation ofseveral functional versions of each piece that are essentiallyarbitrarily combined into a whole protein. The combinatorial growth offunctional forms of the protein creates an impression of immensediversity when mutation is averaged over the genome. Another deficiencywith traditional approaches is the log mutation scores for sites in asequence are summed together (or mutation probabilities are allmultiplied together) to define a number corresponding to an evolutionarydistance between two sequences when separate pieces commonly havedifferent evolutionary distances. Thus, there is a need for improvedtechniques that facilitate vaccine assembly.

SUMMARY

The following presents a simplified summary of the invention in order toprovide a basic understanding of some aspects of the invention. Thissummary is not an extensive overview of the invention. It is intended toneither identify key or critical elements of the invention nor delineatethe scope of the invention. Its sole purpose is to present some conceptsof the invention in a simplified form as a prelude to the more detaileddescription that is presented later.

The subject invention provides system and methods that facilitatevaccine cocktail assembly via machine learning techniques that modelsequence diversity. Such assembly can be utilized to generate vaccinecocktails for species of pathogens that evolve quickly under immunepressure of the host. For example, the systems and methods of thesubject invention can be utilized to facilitate design of T cellvaccines for pathogens such HIV. In addition, the systems and methods ofthe subject invention can be utilized with other applications, such as,for example, sequence alignment, motif discovery, classification, andrecombination hot spot detection.

A resultant vaccine cocktail can be referred to as an “epitome,” or asequence that includes all or many of the short subsequences from alarge set of sequence data, or population. The novel techniquesdescribed herein can provide for improvements over traditionalapproaches that utilize an ancestral sequence from which diversitymushroomed, an average sequence of a population, or a “best” sequence apopulation. For example, vaccine cocktails generated by the systems andmethods of the subject invention can provide for higher epitope coveragein comparison with the cocktails of consensi, phylogenetic tree nodesand random strains from the data. In addition, consensus models and/orphylogenetic tree models are not well-suited to accounting for the largeamount of local diversity in HIV.

In one aspect, a system and/or method that determines epitomes forrapidly evolving pathogens is provided. The system can include an inputcomponent that receives a plurality of patches (e.g., sequences of DNA,RNA, or protein, etc.). Such patches can be a subset or all of apopulation of patches. The received patches can be variable length andconveyed by the input component to a modeling engine. The modelingengine can employ various learning algorithms (e.g.,expectation-maximization (EM), greedy, Bayesian, Hidden Markov, etc.) todetermine the epitome. For example, the modeling engine can determine amost likely epitome, such as, a sequence (e.g., with the greatestcoverage and a shortest sequence for a particular coverage. Upondetermining the epitome, it can be sequenced to create a peptide and/ornucleotide.

In another aspect of the subject invention, systems and methods areprovided for designing AIDS/HIV vaccine cocktail. In one instance, themethods include obtaining AIDS sequence data of contiguous amino acidsubsequences (e.g., all possible subsequences with length thatcorresponds to a typical epitope), building a plurality of disparatesized patches from the sequence data by iteratively increasing a size ofa patch while decreasing an associated free energy (e.g., set equal tozero), aggregating patches to form the AIDS vaccine cocktail by adding amost frequent patch during each iteration unless the patch was alreadyadded. An expectation-maximization (EM) and/or a greedy algorithm can beutilized to optimize respective iterations. In another instance, themethods include receiving a plurality of HIV related sequences,utilizing the sequences, based on their linear nine-amino acid epitopes(e.g., substantially equally immunogenic), to create a compactrepresentation of a large number of HIV related peptides, employing amachine learning algorithm to optimize the representation in terms ofbinding energies, and designing an HIV vaccine cocktail based on therepresentation. Alternatively, the representation can be estimated fromthe sequence by parsing the sequences into shorter peptides and creatinga mosaic sequence that is longer than any individual sequence.

In yet another instance, the systems include a component that receives aplurality of HIV related nine-mers, a component that generates asequence that epitomizes the plurality of nine-mers, a component thatemploys a greedy algorithm (e.g., initialized with a random nine-mer anda variable binding energy estimate) to jointly update a size of thesequence and a free energy, and a component that utilizes the updatedsequence to design an HIV vaccine cocktail. Additionally oralternatively, an expectation-maximization algorithm that concurrentlyoptimizes the updated sequence and a binding energy can be utilized.

The following description and the annexed drawings set forth in detailcertain illustrative aspects of the invention. These aspects areindicative, however, of but a few of the various ways in which theprinciples of the invention may be employed and the present invention isintended to include all such aspects and their equivalents. Otheradvantages and novel features of the invention will become apparent fromthe following detailed description of the invention when considered inconjunction with the drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an exemplary system that employs machine learning todetermine epitomes for rapidly evolving pathogens.

FIG. 2 illustrates an exemplary system that utilizes a cost function tofacilitate determining epitomes.

FIG. 3 illustrates an exemplary system that utilizes anexpectation-maximization (EM) algorithm to facilitate determiningepitomes.

FIG. 4 illustrates an exemplary method for determining epitomes.

FIG. 5 illustrates an exemplary epitome.

FIG. 6 is a graph depicting gene coverage versus length.

FIG. 7 is a graph depicting epitope coverage versus length.

FIG. 8 illustrates an exemplary operating environment.

DETAILED DESCRIPTION

The subject invention relates to systems and methods that utilizemachine learning to model sequence diversity to facilitate vaccinecocktail assembly. Suitable machine learning techniques include costfunctions, expectation-maximization (EM) and greedy algorithms, forexample. Such assembly can be utilized to generate vaccine cocktails forspecies of pathogens that evolve quickly under immune pressure of thehost. For example, the systems and methods of the subject invention canbe utilized to facilitate design of T cell vaccines for pathogens suchHIV.

The present invention is described with reference to the drawings,wherein like reference numerals are used to refer to like elementsthroughout. In the following description, for purposes of explanation,numerous specific details are set forth in order to provide a thoroughunderstanding of the present invention. It may be evident, however, thatthe present invention may be practiced without these specific details.In other instances, well-known structures and devices are shown in blockdiagram form in order to facilitate describing the present invention.

As utilized herein, the terms “epitome,” “sequence,” “instance of amodel,” and the like generally refers to a sequence that includes all ormany of the short subsequences (patches) from a large set or populationof sequence data and/or a sequence whose subsequences (patches) can beassembled to generate a wide range of representative sequences of adesired category. Suitable categories include sequences associated witha specific species, such as HIV, sequences from a specific clade, and/orsequences associated with an acute or chronic phase of infection.

FIG. 1 illustrates a system 100 that determines epitomes (vaccinecocktails) for rapidly evolving pathogens such as HIV. The system 100comprises an input component 110 and a modeling engine 120. The inputcomponent 110 can receive a plurality of patches that can be a subset orall of a population of patches, wherein such patches can be utilized toconstruct an epitome. The received patches can be variable length, forexample, nine-mers, ten-mers, etc. The input component 110 can conveythe patches to the modeling engine 120, which can employ variouslearning algorithms (e.g., expectation-maximization (EM), greedy,Bayesian, Hidden Markov, etc.) that can utilize the patches to determinethe epitome. For example, the modeling engine 120 can be utilized todetermine a most likely epitome. In one instance, the most likelyepitome is defined as the sequence with the greatest coverage. Inanother instance, the most likely epitome is defined as the shortestsequence for a particular coverage. Upon determining the epitome, it canbe utilized to create peptide and/or nucleotide sequencing.

Traditional approaches to designing such vaccines typically modelevolution as a process of random site-independent mutations. However,the environment can affect different pieces of the genome and/orpeptides in a single protein differently. On the population level, thiscan lead to creation of several functional versions of each piece and animpression of immense diversity. In addition, with traditionalapproaches the log mutation scores for sites in a sequence are summedtogether or mutation probabilities are multiplied together to define anumber corresponding to an evolutionary distance between two sequences,when separate pieces commonly have different evolutionary distances. Thenovel approach employed by the system 100 can provide for improvementsover traditional technique via utilizing machine learning techniques. Byway of example, the system 100 can be employed to model sequencediversity to facilitate generating of vaccine cocktails. Such cocktailscan provide for higher epitope coverage in comparison with the cocktailsof consensi, phylogenetic tree nodes and random strains from the data.

FIG. 2 illustrates a system 200 that determines epitomes via a costfunction. The system 200 comprises an input component 210, a modelingengine 220, and a learning component 230. The input component 210 canreceive patches associated with a population and convey the patches tothe modeling engine 120, which can utilize the patches to determine theepitome. The modeling engine 220 can employ the learning component 230to facilitate determining the epitome.

In one aspect of the subject invention, the learning component 230 canemploy a cost function 240 to learn the epitome. For example, thelearning component 230 can employ a cost function 240 that measures thesimilarity of sequence data with an estimate of the epitome. By way ofexample, a set of nucleotide or amino acid patches defined byx={x_(ij)}, wherein i=1, . . . , M (M is a sequence index) and j=1, . .. , N (N is a site (position) index) can be received by the inputcomponent 210 and conveyed to the modeling engine 220. The modelingcomponent 220 can utilize the patches to construct an M×N matrix/arrayof sequence data (an epitome) that can be input to a learning algorithmthat renders the epitome as a smaller array e={e_(mn)} of size Me×Ne,wherein MeNe<<MN. For example, the data can include 12 sequences (M=12)with lengths of about 42 (N=42), whereas the epitome size afterutilizing the learning algorithm can be reduced to Me=1 and Ne=50. It isto be appreciated that the values utilized in the above example areillustrative and do not limit the invention. Moreover, it is to beappreciated that the learning algorithm can optimize the epitome inorder to maximize a number of short subsequences that are present in theinput data, and the input data can be described by its epitome and amapping that links the sites in the data to sites in the epitome.

In order to establish such mapping, the sequence set (patches) x can berepresented as a set of short overlapping subsequences, whereinrespective subsequence x_(S) can include letters from a subset ofsequence positions S. Each index in an index set S generally is twodimensional, pointing both to a sequence and a position within thesequence. These subsequences can be defined on arbitrary biologicalsequences. For example, if X contains M sequences of length N, then thetotal number of contiguous patches in the data of length n is M(N−n)and, thus, the cardinality of S is M(N−n). For each patch x_(S), itsindex set S can be mapped to a hidden set of epitome indices T. In manyinstances contiguous patches x_(S) can be assumed to map to contiguouspatches e_(T) in the epitome so the set T can be identified by the firstindex in the set. A number of possible mappings for each patch isdefined by Me(Ne−n). For HIV amino acid sequence data, thesesubsequences generally are peptides that can correspond to epitopes.With T cell HIV vaccines, the patch length may be equal to the epitopelength (e.g., 8-11 amino acids). However, the context in regionsadjacent to the epitopes can affect HLA binding so the patch length maybe longer, for example, up to about 33 amino acids.

The cost function employed by the learning component 230 to optimize theepitome depends on the application. For example, a cost function thataccounts for various acts that are needed to mount an effective immuneresponse can be utilized, wherein each act can have an associated costin the form of an energy. This energy can be viewed as a negativelog-probability of an event. By way of example, a cost function can beselected to account for the acts utilized to kill an infected cell, forexample, the acts needed for a vaccine e to generate an effective immuneresponse. The vaccine generally is chopped up by cellular mechanisms andshort subsequences (e.g., epitopes) are presented on the surface of theprocessing cell. A positive immune response happens if the clone of thesame T cell can later bind to a virus epitope x_(S) that an infectedcell presents on its surface, initiating the killing of the infectedcell.

In a cell processing a vaccine e, a peptide can be presented on thesurface and bound to a T cell in a process with priming energy E(T). Thepriming energy typically is the sum of the cleavage, HLA binding,transport and/or T cell binding energies, which can influence priming ofan appropriate T cell to attack a cell that presents an epitope patternsimilar to e_(T). In addition, sequence data neighboring an epitope canhave an impact on presentation and, thus, on the priming energy. A Tcell primed with the vaccine epitope e_(T) typically attacks a cell thatpresents a virus epitope x_(S) in a process with attack energy E(x_(S),e_(T)). This energy depends on the cross-reactivity of the T cell. Ifthe patch length is selected so as to account for each epitope plus itsneighboring contextual sequence data, then only a piece of a windowcorresponding to the actual epitope can be utilized to determine theattack energy. The T cell attack energy is lowest when the epitopesubstantially matches the amino acid pattern on the T cell. The energyassociated with priming with e_(T) and attacking x_(S) can be determinedby summing the two energies E(T) and E(x_(S), e_(T)).

In general, for an effective immune response the energy for data set(e.g., many patches from many virus sequences) diversity and/or anability to rapidly evolve can be considered. In particular, the totalenergy typically increases for each patch from the data set that doesnot have a corresponding patch in the epitome that gives a low primingplus attack energy. Equation 1 provides one example of an energy E(x)that satisfies this requirement.

$\begin{matrix}{{E(x)} = {\sum\limits_{S}{\min\limits_{T}{\left( {{E(T)} + {E\left( {x_{S},e_{T}} \right)}} \right).}}}} & {{Equation}\mspace{14mu} 1}\end{matrix}$An effective vaccine can be obtained by finding an epitome thatminimizes this energy. It is to be appreciated that Equation 1 isprovided for illustrative purposes and sake of brevity, and does notlimit the invention.

Each of the above energies (E(T) and E(x_(S), e_(T))) can be consideredan energy associated with a stochastic process at equilibrium, whereinthe energy is equal to a negative log-probability of the event orprocess. A suitable priming probability that can be employed inaccordance with the subject invention is defined by Equation 2:p(T)∝exp(−E(T)),   Equation 2:and a suitable attack probability that can be employed in accordancewith the subject invention can be defined by Equation 3:p(x_(S)|e_(T))∝exp(−E(x_(S), e_(T))).   Equation 3:

Exponentiating both sides of the above equations for the total energyE(x) renders Equation 4, which is a probability of the data set x interms of the priming and attack probabilities:

$\begin{matrix}{{p(x)} \propto {\prod\limits_{S}^{\;}\;{\max\limits_{T}\left( {{p\left( {x_{S}\left. e_{T} \right){p(T)}} \right)},} \right.}}} & {{Equation}\mspace{14mu} 4}\end{matrix}$which illustrates an expression that optimizes the epitome viamaximizing the likelihood of independently generating all patches fromthe data set, wherein patch x_(S) is generated from epitome patch e_(T)with probability p(x_(S)|e_(T)) and patch e_(T) is selected from theepitome with probability p(T).

In instances where ΔE(x_(S), e_(T)) is relatively high (e.g., except forsubstantially perfect matches between x_(S) and e_(T)), the total energycan be closely approximated as const—rE, wherein r is the number of thepatches x_(S) that match their corresponding epitome patch e_(T) and Eis the binding energy for such matches. The foregoing can be derived byletting ΔE go to infinity uniformly across mismatches. The const termcan depend on ΔE and/or the total number of patches K, and typicallydoes not depend on the fraction of the matched patches. Thus, for agiven size of the epitome, the quality of the vaccine can depend only onthe percentage of the matched epitopes.

An exemplary functional form that can behave in this manner in the limitinvolves the letter substitution probability θ. This probability can beuniformly or non-uniformly spread over any or all other possibilities(e.g., other three nucleotides in case of DNA/RNA sequence models orother nineteen amino acids in case of protein models) as illustrated inEquation 5:p(x _(S) |e _(T))=θ^(|x) ^(s) ^(≠e) ^(T) ^(|)(1−θ)^(|x) ^(s) ^(=e) ^(T)^(|),   Equation 5:wherein | | is the number of elements in the vector argument that aretrue, for example, |x_(S)=e_(T)| is the number of elements on which thetwo patches disagree. When the variability parameter θ can approachzero, an exact match model, which is a conservative choice for vaccinedesign as it limits the assumptions on cross-reactivity, can beutilized. The binding energy model corresponding to this distribution isillustrated in Equation 6:

$\begin{matrix}{{E_{x_{s},e_{T}} = {{- n}\mspace{11mu}\log\;\left( {1 - \theta} \right)}},\;{{\Delta\; E_{x_{s},e_{T}}} = {{{x_{ij} \neq e_{T{({ij})}}}}\log\;{\frac{1 - \theta}{\theta}.}}}} & {{Equation}\mspace{14mu} 6}\end{matrix}$

With amino acid epitomes, the substitution parameter θ can be defined sothat it decreases the probability of non-conservative amino acidexchange, thus reflecting to some extent the current understanding ofthe T cell cross-reactivity. The θ parameter can also beposition-dependent. It is to be appreciated that there are other ways ofdescribing the position-specific variability. For example, a fullmultinomial distribution over possible letters can be utilized inaccordance with the subject invention. Utilizing this approach, the fullmultinomial distribution over possible letters, such as, for example,θA, θC, θT, θG, wherein θx is the probability of letter x at a givenposition and θA+θC+θT+θG=1 can be employed.

If the epitome is viewed as a stochastic model, the optimizationcriterion can be written as a likelihood of attacking all epitopes x_(S)as illustrated in Equation 7:

$\begin{matrix}{{p\left( \left\{ x_{S} \right\} \right)} = {\prod\limits_{S}^{\;}\;{\sum\limits_{T}{p\left( {x_{S}{\left. e_{T} \right).}} \right.}}}} & {{Equation}\mspace{14mu} 7}\end{matrix}$Under a conservative assumption, wherein θ is approximated to equal toone, this cost can become equivalent to the epitome's coverage ofsubstantially all virus epitopes. If the cost is defined in terms of thetotal energy barrier summed over substantially all virus epitopes x_(S),then the free energy can be defined as illustrated in Equation 8:

$\begin{matrix}{{F = {\sum\limits_{S}{\sum\limits_{T}{q\text{(}T\left. S \right)\;\log\;\frac{p\left( {T\left. S \right)} \right.}{p\left( {x_{s}\left. e_{T} \right){p(T)}} \right.}}}}},} & {{Equation}\mspace{14mu} 8}\end{matrix}$which combines the binding energies described above via an auxiliarydistribution q(T|S) for each data patch S.

Individual patch energies −log p(x_(S)|e_(T))−log p(T) can be summed toform an estimate of the total energy barrier to the immunity against allforms of the virus if the mapping variable T is known for each sequencefragment S. However, with some probability any piece of the epitome canbe chopped and presented by cellular mechanisms and utilized to prime anappropriate T cell, which could later, as a memory cell, bind to anarbitrary HIV patch x_(S). Thus, similar segments of the epitome canpotentially represent a substantially similar antigen x_(S). Thedistribution over the epitome correspondence is expressed throughq(T|S). In order to compute the average energy over all mappings, anintegration under q as a measure of posterior probability of matchingthe data epitopes to the appropriate epitome patches can be employed. Inaddition, if the epitome has multiple patches that represent some dataepitope x_(S), such epitome can be more effective than an epitome thathas only one way of providing adaptive immunity to this epitope. Thus,the entropy of the distribution q offsets the binding energy, and thefree energy of the epitome sequence can be expressed as above. It is tobe appreciated that although the epitome and the viruses can go throughsubstantially similar acts, there is no total symmetry of S and T inEquation 8 when optimizing targeting all likely targets S in the virusinstead of optimizing the intersection between epitome and a set ofviruses.

The free energy minimum can be equal to the negative log likelihood asillustrated in Equation 9:

$\begin{matrix}{{{- \log}\;{p\left( \left\{ x_{S} \right\} \right)}} = {\arg{\;\;}{\max\limits_{q}{F.}}}} & {{Equation}\mspace{14mu} 9}\end{matrix}$Maximizing the likelihood with respect to the epitome e can beequivalent to minimizing the free energy with respect to the posteriordistributions q(T|S) for all S and the epitome e. A suitable assignmentin the posterior distribution q can require an exact match (e.g., θ=0).

It is to be appreciated that some epitopes are known, but many are not.By studying the escapes in genes, by using databases of epitopes thatare known to be immunogenic for some HLA types, or by studying theMHC/cleavege/transport binding data, the probability p(S) can beassociated with each peptide x_(S) in the data, for example, accordingto how likely the observed pattern is to be presented on the surface ofthe infected cell, which is the prerequisite for the T cell immunity. Ifa peptide is not going to be presented, it needs not be included in theepitome and the free energy is defined as illustrated in Equation 10:

$\begin{matrix}{F = {\sum\limits_{S}{{p(S)}{\sum\limits_{T}{q\text{(}T\left. S \right)\;\log\;{\frac{p\left( {T\left. S \right)} \right.}{p\left( {x_{S}\left. e_{T} \right){p(T)}} \right.}.}}}}}} & {{Equation}\mspace{14mu} 10}\end{matrix}$Utilizing a conservative assumption (as discussed above), the vaccineoptimization algorithm can be defined by Equation 11:

$\begin{matrix}{e = {\lim\limits_{\theta\rightarrow 0}\;{\arg{\;\;}{\min\limits_{e}\;{\min\limits_{q}\;{F.}}}}}} & {{Equation}\mspace{14mu} 11}\end{matrix}$

FIG. 3 illustrates a system 300 that determines epitomes via anexpectation-maximization (EM) algorithm. The system 300 comprises aninput component 310, a modeling engine 320, and a learning component330. The input component 310 can receive patches and convey them to themodeling engine 320, which can utilize the sequences to determine theepitome. The modeling engine 320 can employ the learning component 330,which can utilize a cost function 340, an EM algorithm 350, and/or agreedy algorithm 360. The modeling engine 320 can employ the EMalgorithm 350 to facilitate determining the epitome. For example, byconsidering the size of the epitome as prescribed (e.g., by vaccine thedelivery constraints) and utilizing an initial random guess for theepitome parameters, the above can be performed via an iterativeoptimization by utilizing the EM algorithm 350.

By way of example, for each x_(S) the posterior distribution q ofpositions T can be estimated by Equation 12:

$\begin{matrix}{{q\text{(}T\left. S \right)} = {\frac{p\left( {x\; S\left. {e\; T} \right){p(T)}} \right.}{\sum\limits_{T}{p\left( {x\; S\left. {e\; T} \right){p(T)}} \right.}}.}} & {{Equation}\mspace{14mu} 12}\end{matrix}$The epitome that minimizes the free energy can be re-estimated asillustrated in Equation 13 and Equation 14:

$\begin{matrix}{{{e_{m\; n} = {\arg\;\underset{e_{m\; n}}{\;\max}{\sum\limits_{{T{(i)}} = {({m,n})}}{q\text{(}T{\left. S \right)\left\lbrack {x_{s{(i)}} = e_{mn}} \right\rbrack}}}}},{and}}\text{}} & {{Equation}\mspace{14mu} 13} \\{\theta = {\frac{\sum\limits_{m,n}{\sum\limits_{s}{{p(S)}{\sum\limits_{{T{(i)}} = {({m,n})}}{q\left( {T{\left. S \right)\left\lbrack {x_{S{(i)}} \neq e_{mn}} \right\rbrack}} \right.}}}}}{\sum\limits_{m,n}{\sum\limits_{s}{{p(S)}{\sum\limits_{{T{(i)}} = {({m,n})}}{q\left( {T\left. S \right)} \right.}}}}}.}} & {{Equation}\mspace{14mu} 14}\end{matrix}$Iterating these equations is an expectation maximization (EM) algorithmfor the epitome model, which reduces the free energy in each act, thusconverging to the local minimum of the free energy and the local maximumof the likelihood.

The EM algorithm 350 can jointly and concurrently optimize both theepitome and the binding energy parameters θ. The algorithm can beinitialized with a random epitome and a relatively large variabilityestimate θ. After several iterations, θ generally decreases as theepitome starts to more closely match the data and the uncertaintycontracts. The energy barrier ΔE_(x) _(S) _(,e) _(T) to non-exactmatches can become relatively steep capturing the conservativeassumption on high T cell specificity. If the epitome is not longenough, then the algorithm decreases the allowed variability (and thusincreases specificity) to a level where the balance between covering allthe data and allowing for as little cross-reactivity as possible isreached for the assumed energy model. The variability can be furtherdecreased to force the model to fit as many patches as possible withoutany latitude on cross-reactivity. It is to be appreciated that variousother algorithms such as the greedy algorithm, Hidden Markov model,neural network, and/or Bayesian-based algorithms can be utilized inaccordance with an aspect of the subject invention. For example, thegreedy algorithm can be utilized to jointly update the size of theepitome sequence or sequences and the free energy in a greedy fashion.

Optionally, an intelligence component 370 can be employed in accordancewith an aspect of the invention. In one instance, the intelligencecomponent 370 can be utilized to facilitate determining which learningalgorithm to employ. For example, the machine learning component 360 canprovide various cost functions, expectation-maximization algorithms,greedy algorithms, etc. as described above. The intelligence component370 can determine which algorithm(s) should be employed, for example,based on a desired vaccine, a set of input patches, epitope length, etc.In addition, the intelligence component 370 can perform a utility-basedanalysis in connection with selecting an algorithm to utilize, withdetermining an epitome, and/or with optimizing an epitome.

In another aspect of the invention, the intelligent component 370 canperform a probabilistic and/or statistic-based analysis in connectionwith inferring and/or determining a suitable machine learning algorithmand/or an epitome. As utilized herein, the term “inference” andvariations thereof refer generally to the process of reasoning about orinferring states of the system, environment, and/or user from a set ofobservations as captured via events and/or data. Inference can beemployed to identify a specific context or action, or can generate aprobability distribution over states, for example. The inference can beprobabilistic—that is, the computation of a probability distributionover states of interest based on a consideration of data and events.Inference can also refer to techniques employed for composinghigher-level events from a set of events and/or data. Such inferenceresults in the construction of new events or actions from a set ofobserved events and/or stored event data, whether or not the events arecorrelated in close temporal proximity, and whether the events and datacome from one or several event and data sources. Various classification(explicitly and/or implicitly trained) schemes and/or systems (e.g.,support vector machines, neural networks, expert systems, Bayesianbelief networks, fuzzy logic, data fusion engines . . . ) can beemployed in connection with performing automatic and/or inferred actionin connection with the subject invention.

FIG. 4 illustrates a methodology 400 that determines epitomes forpathogens such as HIV. For simplicity of explanation, the methodology isdepicted and described as a series of acts. It is to be understood andappreciated that the present invention is not limited by the actsillustrated and/or by the order of acts, for example acts can occur invarious orders and/or concurrently, and with other acts not presentedand described herein. Furthermore, not all illustrated acts may berequired to implement the methodology in accordance with the presentinvention. In addition, those skilled in the art will understand andappreciate that the methodology could alternatively be represented as aseries of interrelated states via a state diagram or events.

At 410, a plurality of patches, or sequences, which can be a subset orall of a population of sequences, is received. Such patches can bevariable length, for example, nine-mers, ten-mers, etc. At 420, variouslearning algorithms can be utilized to determine the epitome, based onthe received sequences. For examples, learning algorithms such as a costfunction (as described herein), an expectation-maximization (EM)algorithm (as described herein), a greedy algorithm, Bayesian models,Hidden Markov models, neural networks, etc. can be employed inconnection with various aspect of the subject invention. It is to beappreciated that the resultant epitome can be a most likely epitome suchas an epitome that includes a sequence with the greatest coverage, ashortest sequence for a particular coverage, etc. At reference numeral430, the epitome can be output. It is to be appreciated that such anepitome can be utilized to create peptide and/or nucleotide sequencingto generate an AIDS vaccine cocktail. This novel approach can providefor improvements over traditional techniques by modeling sequencediversity through machine learning. Resulting vaccines (for HIV) canprovide for higher epitope coverage in comparison with the cocktails ofconsensi, phylogenetic tree nodes and random strains from the data.

FIG. 5 illustrates an exemplary epitome 500 and a plurality of patches(sequences) 510 that the epitome 500 epitomizes in terms of linearnine-amino acid epitopes, assuming that all nine-mers are equallyimmunogenic and exposure to the immune system leads to nocross-reactivity. The identified row numbers and their corresponding SEQID NO is provided as follows: Row 1 (SEQ ID NO: 1); Row 2 (SEQ ID NO:2); Row 3 (SEQ ID NO: 3); Row 4 (SEQ ID NO: 4); Row 5 (SEQ ID NO: 5);Row 6 (SEQ ID NO: 6); Row 7 (SEQ ID NO: 7); Row 8 (SEQ ID NO: 8); Row 9(SEQ ID NO: 9); Row 10 (SEQ ID NO: 10); Row 11 (SEQ ID NO: 11); Row 12(SEQ ID NO: 12); and Row 13 (SEQ ID NO: 13). Although nine-mers aredepicted, it is to be appreciated that essentially any mer (e.g.,ten-mers, eleven-mers, etc.) can be utilized in various aspects of thesubject invention, and any or all assumptions can be relaxed. Asillustrated at 520, 530, 540 and 550, three portions of the epitome 500can be matched with various portions of the plurality of sequences. Suchmatching can be achieved by moving a window (e.g., nine-long, asdepicted in FIG. 5) over the epitome 500, for example, from left toright. While moving the window, the window can be matched with acorresponding sequence epitopes. The epitome 500 can be estimated fromthe data by chopping up the input sequences 510 into short peptides ofepitope length or longer and creating a mosaic sequence longer than anygiven data sequence, but much shorter than the sum of all input sequencelengths. It is to be appreciated that even though it may be desirable toachieve coverage of short epitopes, due to the overlaps in theseepitopes in the data, the epitome may favor conservation of long aminoacid stretches from the epitomized sequences. Therefore, the epitome canalso be viewed as a collection of longer or shorter protein piecesneeded to compose each of the given sequences.

FIG. 6 depicts a graph 600 that illustrates epitome coverage of aplurality of different CAG genes over length, and FIG. 7 depicts a graph700 that illustrates epitome coverage of various epitopes of a GAG geneover length. In these figures, respective axes 610 and 710 correspond tocoverage as a function of percent and respective axes 620 and 720corresponds to length. In this example, epitomes of size 1×Ne can beutilized. However, as a vaccine the epitome may need to be delivered ina different format, which can be achieved by chopping the 1×Ne epitomeinto smaller pieces or directly optimizing an epitome of a requiredformat as described herein. The patches derived from the sequence datacan include all possible contiguous amino acid subsequences, forexample, of size nine, corresponding to the length of a typical epitope,with indices S=11, 12, . . . , 19. In order to include a context thatcan affect escape, the patches may need to be longer. However,optimizing for coverage of shorter patches can lead to preservation of alarger context around any or all patches due to patch overlaps both indata and in the epitome. To compute various vaccine components, anexpectation-maximization (EM) algorithm, a greedy algorithm, and thelike can be utilized to train a mixture of profile sequences, forexample, sequences in which each site has an associated most likelyletter and a probability of generating any other letter.

Epitomes of various sizes can be utilized, wherein such epitomes can beconstructed by iteratively increasing the size of the epitome anddecreasing the free energy with the assumption θ=0, thus increasingcoverage of the epitopes from the data. Respective acts can be optimalincremental moves, for example, by adding a most frequent data patchthat is not yet included in the epitome. This optimization follows aconservative assumption that none of the epitopes in the sampled virusesshould be a priori ignored in an effective vaccine (e.g., p(S)=const)and only an exact copy of epitope in the vaccine will lead to aneffective vaccine (θ=0). Thus, the efficiency of the optimizationalgorithms can be evaluated by a percentage of the data patches that areexactly copied in the epitome. As discussed previously, coverage isrelated to the free energy and can be more intuitive when θ=0.

In order to provide additional context for implementing various aspectsof the present invention, FIG. 8 and the following discussion isintended to provide a brief, general description of a suitable computingenvironment in which the various aspects of the present invention may beimplemented. While the invention has been described above in the generalcontext of computer-executable instructions of a computer program thatruns on a local computer and/or remote computer, those skilled in theart will recognize that the invention also may be implemented incombination with other program modules. Generally, program modulesinclude routines, programs, components, data structures, etc., thatperform particular tasks and/or implement particular abstract datatypes.

Moreover, those skilled in the art will appreciate that the inventivemethods may be practiced with other computer system configurations,including single-processor or multi-processor computer systems,minicomputers, mainframe computers, as well as personal computers,hand-held computing devices, microprocessor-based and/or programmableconsumer electronics, and the like, each of which may operativelycommunicate with one or more associated devices. The illustrated aspectsof the invention may also be practiced in distributed computingenvironments where certain tasks are performed by remote processingdevices that are linked through a communications network. However, some,if not all, aspects of the invention may be practiced on stand-alonecomputers. In a distributed computing environment, program modules maybe located in local and/or remote memory storage devices.

With reference to FIG. 8, an exemplary environment 800 for implementingvarious aspects of the invention includes a computer 812. The computer812 includes a processing unit 814, a system memory 816, and a systembus 818. The system bus 818 couples system components including, but notlimited to, the system memory 816 to the processing unit 814. Theprocessing unit 814 can be any of various available processors. Dualmicroprocessors and other multiprocessor architectures also can beemployed as the processing unit 814.

The system bus 818 can be any of several types of bus structure(s)including the memory bus or memory controller, a peripheral bus orexternal bus, and/or a local bus using any variety of available busarchitectures including, but not limited to, Industrial StandardArchitecture (ISA), Micro-Channel Architecture (MSA), Extended ISA(EISA), Intelligent Drive Electronics (IDE), VESA Local Bus (VLB),Peripheral Component Interconnect (PCI), Card Bus, Universal Serial Bus(USB), Advanced Graphics Port (AGP), Personal Computer Memory CardInternational Association bus (PCMCIA), Firewire (IEEE 1394), and SmallComputer Systems Interface (SCSI).

The system memory 816 includes volatile memory 820 and nonvolatilememory 822. The basic input/output system (BIOS), containing the basicroutines to transfer information between elements within the computer812, such as during start-up, is stored in nonvolatile memory 822. Byway of illustration, and not limitation, nonvolatile memory 822 caninclude read only memory (ROM), programmable ROM (PROM), electricallyprogrammable ROM (EPROM), electrically erasable ROM (EEPROM), or flashmemory. Volatile memory 820 includes random access memory (RAM), whichacts as external cache memory. By way of illustration and notlimitation, RAM is available in many forms such as synchronous RAM(SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rateSDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), Synchlink DRAM (SLDRAM), anddirect Rambus RAM (DRRAM).

Computer 812 also includes removable/non-removable,volatile/non-volatile computer storage media. FIG. 8 illustrates, forexample a disk storage 824. Disk storage 824 includes, but is notlimited to, devices like a magnetic disk drive, floppy disk drive, tapedrive, Jazz drive, Zip drive, LS-100 drive, flash memory card, or memorystick. In addition, disk storage 824 can include storage mediaseparately or in combination with other storage media including, but notlimited to, an optical disk drive such as a compact disk ROM device(CD-ROM), CD recordable drive (CD-R Drive), CD rewritable drive (CD-RWDrive) or a digital versatile disk ROM drive (DVD-ROM). To facilitateconnection of the disk storage devices 824 to the system bus 818, aremovable or non-removable interface is typically used such as interface826.

It is to be appreciated that FIG. 8 describes software that acts as anintermediary between users and the basic computer resources described inthe suitable operating environment 800. Such software includes anoperating system 828. Operating system 828, which can be stored on diskstorage 824, acts to control and allocate resources of the computersystem 812. System applications 830 take advantage of the management ofresources by operating system 828 through program modules 832 andprogram data 834 stored either in system memory 816 or on disk storage824. It is to be appreciated that the present invention can beimplemented with various operating systems or combinations of operatingsystems.

A user enters commands or information into the computer 812 throughinput device(s) 836. Input devices 836 include, but are not limited to,a pointing device such as a mouse, trackball, stylus, touch pad,keyboard, microphone, joystick, game pad, satellite dish, scanner, TVtuner card, digital camera, digital video camera, web camera, and thelike. These and other input devices connect to the processing unit 814through the system bus 818 via interface port(s) 838. Interface port(s)838 include, for example, a serial port, a parallel port, a game port,and a universal serial bus (USB). Output device(s) 840 use some of thesame type of ports as input device(s) 836. Thus, for example, a USB portmay be used to provide input to computer 812, and to output informationfrom computer 812 to an output device 840. Output adapter 842 isprovided to illustrate that there are some output devices 840 likemonitors, speakers, and printers, among other output devices 840, whichrequire special adapters. The output adapters 842 include, by way ofillustration and not limitation, video and sound cards that provide ameans of connection between the output device 840 and the system bus818. It should be noted that other devices and/or systems of devicesprovide both input and output capabilities such as remote computer(s)844.

Computer 812 can operate in a networked environment using logicalconnections to one or more remote computers, such as remote computer(s)844. The remote computer(s) 844 can be a personal computer, a server, arouter, a network PC, a workstation, a microprocessor based appliance, apeer device or other common network node and the like, and typicallyincludes many or all of the elements described relative to computer 812.For purposes of brevity, only a memory storage device 846 is illustratedwith remote computer(s) 844. Remote computer(s) 844 is logicallyconnected to computer 812 through a network interface 848 and thenphysically connected via communication connection 850. Network interface848 encompasses wire and/or wireless communication networks such aslocal-area networks (LAN) and wide-area networks (WAN). LAN technologiesinclude Fiber Distributed Data Interface (FDDI), Copper Distributed DataInterface (CDDI), Ethernet, Token Ring and the like. WAN technologiesinclude, but are not limited to, point-to-point links, circuit switchingnetworks like Integrated Services Digital Networks (ISDN) and variationsthereon, packet switching networks, and Digital Subscriber Lines (DSL).

Communication connection(s) 850 refers to the hardware/software employedto connect the network interface 848 to the bus 818. While communicationconnection 850 is shown for illustrative clarity inside computer 812, itcan also be external to computer 812. The hardware/software necessaryfor connection to the network interface 848 includes, for exemplarypurposes only, internal and external technologies such as, modemsincluding regular telephone grade modems, cable modems and DSL modems,ISDN adapters, and Ethernet cards.

As utilized in this application, terms “component,” “system,” “engine,”and the like are intended to refer to a computer-related entity, eitherhardware, software (e.g., in execution), and/or firmware. For example, acomponent can be a process running on a processor, a processor, anobject, an executable, a program, and/or a computer. By way ofillustration, both an application running on a server and the server canbe a component. One or more components can reside within a process and acomponent can be localized on one computer and/or distributed betweentwo or more computers.

What has been described above includes examples of the presentinvention. It is, of course, not possible to describe every conceivablecombination of components or methodologies for purposes of describingthe present invention, but one of ordinary skill in the art mayrecognize that many further combinations and permutations of the presentinvention are possible. Accordingly, the present invention is intendedto embrace all such alterations, modifications, and variations that fallwithin the spirit and scope of the appended claims.

In particular and in regard to the various functions performed by theabove described components, devices, circuits, systems and the like, theterms (including a reference to a “means”) used to describe suchcomponents are intended to correspond, unless otherwise indicated, toany component which performs the specified function of the describedcomponent (e.g., a functional equivalent), even though not structurallyequivalent to the disclosed structure, which performs the function inthe herein illustrated exemplary aspects of the invention. In thisregard, it will also be recognized that the invention includes a systemas well as a computer-readable medium having computer-executableinstructions for performing the acts and/or events of the variousmethods of the invention.

In addition, while a particular feature of the invention may have beendisclosed with respect to only one of several implementations, suchfeature may be combined with one or more other features of the otherimplementations as may be desired and advantageous for any given orparticular application. Furthermore, to the extent that the terms“includes,” and “including” and variants thereof are used in either thedetailed description or the claims, these terms are intended to beinclusive in a manner similar to the term “comprising.”

What is claimed is:
 1. A system that facilitates determining an epitomethat provides a basis for a vaccine cocktail, comprising: a processingunit; a system memory coupled to the processing unit; an input componentthat receives a plurality of overlapping patches x_(S) corresponding toa set of subsequences from one or more pathogen sequences in apopulation; and a modeling engine that employs one or more machinelearning algorithms to determine an epitome based on the plurality ofoverlapping patches x_(S), the epitome providing the basis for thevaccine cocktail, wherein the one or more machine learning algorithmscomprises an expectation-maximization (EM) algorithm that includes aninitial random guess for the epitome and iteratively reduces a freeenergy of the epitome converging to a local minimum of free energy,wherein the free energy combines T cell binding energy and HLA bindingvia a variational mapping distribution of respective ones of theplurality of overlapping patches x_(S) to the epitome.
 2. The system ofclaim 1, wherein the plurality of overlapping patches are pathogensubsequences assembled to generate representative sequences of acategory.
 3. The system of claim 2, wherein the category comprises atleast one of sequences associated with a specific species, sequencesassociated with a specific clade, sequences associated with an acutephase of infection, or sequences associated with a chronic phase ofinfection.
 4. The system of claim 2, wherein the category comprises HIV.5. The system of claim 1, wherein at least one of the pathogensubsequences relates to an epitope.
 6. The system of claim 1, wherein atleast one of the pathogen subsequences comprises an epitope.
 7. Thesystem of claim 1, wherein the plurality of overlapping patches compriseshort subsequences, at least one of which contains an unknown epitopehaving a non-zero probability of being presented on a surface of aninfected cell.
 8. The system of claim 1, wherein at least one of the oneor more machine learning algorithms determines an epitome that minimizesan energy needed to mount an effective immune response as determined bya cost function.
 9. The system of claim 8, wherein the cost functiondetermines a minimum total energy for a given length of the epitome. 10.The system of claim 1, wherein at least one of the one or more machinelearning algorithms minimizes the length of the epitome, theminimization subject to the constraint that the resulting epitome has acost less than or equal to a given cost.
 11. The system of claim 9,wherein the total energy is based at least in part on the free energy.12. The system of claim 1, wherein the free energy is calculatedaccording to at least one of: a frequency of occurrence of one of theplurality of overlapping patches x_(S) in the population; a probabilitythat one of the plurality of overlapping patches x_(S) is found in asingle strain of the population; a probability of occurrence of one ofthe plurality of overlapping patches x_(S) in a population wherein thesequencing data is ambiguous; a value that reflects both the frequencyof one of the plurality of overlapping patches x_(S) and whether one ofthe plurality of overlapping patches x_(S) contains a known epitope; aprobability that one of the plurality of overlapping patches x_(S) is anepitope; a probability that one of the plurality of overlapping patchesx_(S) will be presented by a cell; or a probability that an individualvaccinated with one of the plurality of overlapping patches x_(S) willmount an immune response.
 13. The system of claim 11, wherein the freeenergy is calculated according to at least one of: a frequency ofoccurrence of one of the plurality of overlapping patches x_(S) in thepopulation; a probability that one of the plurality of overlappingpatches x_(S) is found in a single strain of the population; aprobability of occurrence of one of the plurality of overlapping patchesx_(S) in a population wherein the sequencing data is ambiguous; a valuethat reflects both the frequency of one of the plurality of overlappingpatches x_(S) and whether one of the plurality of overlapping patchesx_(S) contains a known epitope; a probability that one of the pluralityof overlapping patches x_(S) is an epitope; a probability that one ofthe plurality of overlapping patches x_(S) will be presented by a cell;or a probability that an individual vaccinated with one of the pluralityof overlapping patches x_(S) will mount an immune response.
 14. Thesystem of claim 8, wherein the cost function measures an inversesimilarity of the plurality of overlapping patches with an estimate ofthe epitome.
 15. The system of claim 9, wherein the cost functionmeasures an inverse similarity of the plurality of overlapping patcheswith an estimate of the epitome.
 16. The system of claim 10, wherein thecost function measures an inverse similarity of the plurality ofoverlapping patches with an estimate of the epitome.
 17. The system ofclaim 8, wherein the cost function is determined according to a hammingdistance of less than a fixed integer.
 18. The system of claim 8,wherein the cost function is determined according to a probability ofone of the plurality of overlapping patches x_(S) given a patch e_(T) ofthe epitome.
 19. The system of claim 8, wherein the cost function isdetermined according to a probability density function of one of theplurality of overlapping patches x_(S) given a patch e_(T) of theepitome.
 20. The system of claim 8, wherein the cost function comprisesan expected fraction of the plurality of overlapping patches relating toone or more strains of the population and wherein expectation is takenover the probability that a patch contains an epitope.
 21. The system ofclaim 9, wherein the cost function comprises an expected fraction of theplurality of overlapping patches relating to one or more strains of thepopulation and wherein expectation is taken over the probability that apatch contains an epitope.
 22. The system of claim 8, wherein the costfunction between one of the plurality of overlapping patches x_(S) and apatch e_(T) of the epitome is an exponential of a binding energyreflecting the binding of a T-cell primed with one peptide to anotherpeptide.
 23. The system of claim 1, the plurality of overlapping patchescomprising variable length peptides.
 24. The system of claim 1, furthercomprising an intelligence component to optimize the epitome based oninferences.
 25. The system of claim 1, wherein the one or more machinelearning algorithms model sequence diversity for the pathogen sequencesin the population.
 26. The system of claim 1, wherein the epitome is anAIDS vaccine cocktail.
 27. The system of claim 1, wherein the one ormore machine learning algorithms optimize the epitome by maximizing anumber of short subsequences that are present in the plurality ofoverlapping patches.
 28. The system of claim 1, wherein the pathogensequences are peptides and a length of at least one patch is about 8-11amino acids.
 29. The system of claim 1, wherein the free energy is anegative log-probability of an event.
 30. Computer-executableinstructions for performing a computer-implemented method to determinean epitome to facilitate vaccine design, the computer-executableinstructions stored on computer-readable media, the computer-implementedmethod comprising: receiving a plurality of patches to one or morelearning algorithms; determining an epitome based on the plurality ofpatches by employing an expectation-maximization (EM) algorithm thatincludes an initial sequence for the epitome and iteratively reduces afree energy of the epitome converging to a local minimum of free energy,wherein the free energy combines T cell binding energy and HLA bindingvia a variational mapping distribution of respective ones of theplurality of patches to the epitome; matching a portion of the epitometo at least one region of at least one patch by moving a window over theepitome and matching the portion of the epitome included in the windowto the at least one region of the at least one patch; and utilizing theepitome to design a vaccine.
 31. The method of claim 30, wherein thevaccine is an AIDS vaccine.
 32. The method of claim 30 , furthercomprising parsing at least one of the patches into shorter sequences ofepitope length and creating a mosaic sequence that is longer than any ofthe shorter sequences.
 33. The method of claim 30, wherein the epitomeis a mosaic sequence with a length greater than a length of anyindividual patch, but less than a sum of all patch lengths.
 34. Themethod of claim 30, wherein the EM algorithm is initialized with a largevariability estimate for binding energy parameters relative to thebinding energy parameters after several iterations of the EM algorithm.35. The method of claim 30, further comprising packing patches of adefined length into the epitome.
 36. The method of claim 31, wherein theepitome is an HIV vaccine cocktail.
 37. The method of claim 30, furthercomprising optimizing the epitome by maximizing a number of shortsubsequences that are present in the plurality of patches.
 38. A systemthat facilitates identifying an epitome for generating AIDS vaccinecocktails, comprising: a processing unit; a system memory coupled to theprocessing unit; an input component configured to receive a plurality ofoverlapping patches having sequences and store at least a subset of theplurality of overlapping patches in the system memory; and a machinelearning component configured to employ machine learning to modelsequence diversity for identifying an epitome to facilitate AIDS vaccinecocktail assembly, wherein the machine learning component comprises: acost function that accounts for acts that are needed to mount aneffective immune response; and an expectation-maximization (EM)algorithm that includes an initial random guess for the epitome anditeratively reduces a free energy of the epitome converging to a localminimum of free energy, wherein the free energy combines T cell bindingenergy and HLA binding via a variational mapping distribution ofrespective ones of the plurality of overlapping patches to the epitome.39. The system of claim 38, wherein the EM algorithm is initialized witha large variability estimate for binding energy parameters relative tothe binding energy parameters after several iterations of the EMalgorithm.
 40. The system of claim 38, wherein the cost function isdetermined according to a hamming distance of less than a fixed integer.41. The system of claim 38, wherein the free energy is calculatedaccording to at least one of: a frequency of occurrence of one of theplurality of overlapping patches in the population; a probability thatone of the plurality of overlapping patches is found in a single strainof the population; a probability of occurrence of one of the pluralityof overlapping patches in a population wherein the sequencing data isambiguous; a value that reflects both the frequency of one of theplurality of overlapping patches and whether one of the plurality ofoverlapping patches contains a known epitope; a probability that one ofthe plurality of overlapping patches is an epitope; a probability thatone of the plurality of overlapping patches will be presented by a cell;or a probability that an individual vaccinated with one of the pluralityof overlapping patches will mount an immune response.
 42. The system ofclaim 9, wherein the cost function measures an inverse similarity of theplurality of overlapping patches x_(S) with an estimate of the epitome.43. The method of claim 30, wherein the free energy is calculatedaccording to at least one of: a frequency of occurrence of a patch inthe population; a probability that patch is found in a single strain ofthe population; a probability of occurrence of patch in a populationwherein the sequencing data is ambiguous; a value that reflects both thefrequency of patch and whether patch contains a known epitope; aprobability that patch is an epitope; a probability that patch will bepresented by a cell; or a probability that an individual vaccinated withpatch will mount an immune response.
 44. The system of claim 1, whereinthe free energy is calculated according to: a frequency of occurrence ofone of the plurality of overlapping patches x_(S) in the population; aprobability that one of the plurality of overlapping patches x_(S) isfound in a single strain of the population; a probability of occurrenceof one of the plurality of patches x_(S) in a population wherein thesequencing data is ambiguous; a value that reflects both the frequencyof one of the plurality of overlapping patches x_(S) and whether one ofthe plurality of patches x_(S) contains a known epitope; a probabilitythat one of the plurality of overlapping patches x_(S) is an epitope; aprobability that one of the plurality of overlapping patches x_(S) willbe presented by a cell; and a probability that an individual vaccinatedwith one of the plurality of overlapping patches x_(S) will mount animmune response.
 45. The system of claim 11, wherein the free energy iscalculated according to: a frequency of occurrence of one of theplurality of overlapping patches x_(S) in the population; a probabilitythat one of the plurality of overlapping patches x_(S) is found in asingle strain of the population; a probability of occurrence of one ofthe plurality of overlapping patches x_(S) in a population wherein thesequencing data is ambiguous; a value that reflects both the frequencyof one of the plurality of overlapping patches x_(S) and whether one ofthe plurality of overlapping patches x_(S) contains a known epitope; aprobability that one of the plurality of overlapping patches x_(S) is anepitope; a probability that one of the plurality of overlapping patchesx_(S) will be presented by a cell; and a probability that an individualvaccinated with one of the plurality of overlapping patches x_(S) willmount an immune response.
 46. The method of claim 30, wherein the freeenergy is calculated according to: a frequency of occurrence of a patchin the population; a probability that patch is found in a single strainof the population; a probability of occurrence of patch in a populationwherein the sequencing data is ambiguous; a value that reflects both thefrequency of patch and whether patch contains a known epitope; aprobability that patch is an epitope; a probability that patch will bepresented by a cell; and a probability that an individual vaccinatedwith patch will mount an immune response.
 47. The system of claim 1,wherein the free energy F is defined as:$F = {\sum\limits_{S}{\sum\limits_{T}{{q\left( T \middle| S \right)}\log\frac{q\left( T \middle| S \right)}{{p\left( x_{s} \middle| e_{T} \right)}{p(T)}}}}}$where S is a sequence fragment, T is a hidden set of vaccine cocktailindices, e_(T) is a patch in the vaccine cocktail, and q(T|S) is thevariational mapping distribution.